-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a panic related to updating alloc taskgroups #5805
Conversation
When an alloc is stopped, test that we don't update the job found in alloc with new job that is no longer relevent for this alloc.
Alloc runner already tracks tasks associated with alloc. Here, we become defensive by relying on the alloc runner tracked tasks, rather than depend on server never updating the job unexpectedly.
@@ -230,18 +230,18 @@ func (s *StateStore) UpsertPlanResults(index uint64, results *structs.ApplyPlanR | |||
return err | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we used to update the Job field on all allocs (including stopped) in 0.9.1. The plan applier used to add both stopped and new allocs here [1] , and the state store would set the job from the plan to all allocs [2].
So something else must have changed in tandem to cause this panic in 0.9.2. Were you able to determine if any other additional changes besides the plan applier changes contributed to this?
[1] https://github.com/hashicorp/nomad/blob/v0.9.1/nomad/plan_apply.go#L168
[2] https://github.com/hashicorp/nomad/blob/v0.9.1/nomad/state/state_store.go#L191
Was there something in allocrunner that changed as well in 0.9.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 0.9.1, DenormalizeAllocationJobs
would set alloc.Job
if it's not set already and it's not terminal: https://github.com/hashicorp/nomad/blob/v0.9.1/nomad/structs/funcs.go#L275-L286 .
I confirmed this behavior by adding some logging statements in various places. I can demo it offline.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This fixes a panic in the client code in 0.9.3 that occur if a name of a job's TaskGroup changed. The cause being that server updated the stopped alloc with updated job so the alloc job task group no longer matched the actual task group the alloc is running.
In a sense, this is a follow up of #5717 but for stopped allocs instead of pre-empted allocs.
The test captures the failing condition in https://travis-ci.org/hashicorp/nomad/jobs/543933168 .
Also, I added some defensive code so that the alloc runner uses the tasks map that it initialized internally when starting the alloc rather than rely on alloc job field not getting modified.